build: Extend tokenizer capabilities by benITo47 · Pull Request #1114 · software-mansion/react-native-executorch

benITo47 · 2026-04-29T19:02:08Z

Description

This PR introduces rebuilt binaries that contain new, updated tokenizers.
This iteration features support for more tokenisation models (i.e. unigram, worldlevel) as well as bunch of previously unsupported pre-tokenisers, decoders, post-processors.

Introduces a breaking change?

Yes
No

Type of change

Bug fix (change which fixes an issue)
New feature (change which adds functionality)
Documentation update (improves or adds clarity to existing documentation)
Other (chores, tests, code style improvements etc.)

Tested on

iOS
Android

Testing instructions

Before merging, test all demo applications. See if all models that proved problematic during bumps in the past are working (i.e. kokoro, multi-method models)
Check all LLM models, see if output is working.

Screenshots

Related issues

Checklist

I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have updated the documentation accordingly
My changes generate no new warnings

Additional notes

msluszniak

Ok, tested all Android demo apps and they worked. Unfortunately, I don't have any iOS with me. We need someone to test iOS as well and then, I think we are ready to ship it.

chmjkb · 2026-05-04T18:19:39Z

Ok, tested all Android demo apps and they worked. Unfortunately, I don't have any iOS with me. We need someone to test iOS as well and then, I think we are ready to ship it.

ill take a look tomorrow

chmjkb · 2026-05-04T18:20:24Z

is there any particular tokenizer this should be tested with?

msluszniak · 2026-05-04T18:24:46Z

is there any particular tokenizer this should be tested with?

Yes, unigram. You can test it by running model from this PR: #1115

msluszniak

🚀

…#1115) ## Description Adds the `paraphrase-multilingual-MiniLM-L12-v2` sentence-transformer model — the second multilingual embeddings model after distiluse, completing #945. Ships **only the XNNPACK 8da4w variant** under `MODEL_REGISTRY.ALL_MODELS` (see "Why a single variant" below). 384-d output, max 126 tokens, 50+ languages. Tokenizer is Unigram + Precompiled normalizer + Metaspace decoder — **requires the bumped `pytorch/extension/llm/tokenizers` runtime from #1114**, so this PR blocks on that landing first and should be rebased onto main once #1114 merges. HF repo: [software-mansion/react-native-executorch-paraphrase-multilingual-MiniLM-L12-v2](https://huggingface.co/software-mansion/react-native-executorch-paraphrase-multilingual-MiniLM-L12-v2) (`v0.9.0` tag, layout mirrors distiluse). **Why a single variant** — TLDR 8da4w works faster then all and was also one of the smallest, without loss in precision. Longer answer: unlike distiluse, where Core ML fp32 won iPhone thanks to ANE acceleration, benchmarks on iPhone 17 Pro + OnePlus 12 (~80-token input, 50 measured forwards after 3 warmups) showed the XNNPACK 8da4w variant Pareto-dominates the other three on both platforms: faster than XNNPACK fp32, Core ML fp32 *and* Core ML fp16 on iPhone, and ~36% smaller steady-state memory footprint than the next-best variant. Likely cause: paraphrase-multilingual-MiniLM-L12-v2 is a smaller model (~118 M params, 12 layers) where Core ML's runtime doesn't push enough work onto ANE for the precision-conversion overhead to pay off. fp16 being slower than fp32 on Core ML for this model is a tell that the runtime is falling back to slower compute units. Shipping only `_8DA4W` keeps the public surface aligned with the data; if a future Core ML or model update flips the verdict, easy to add the other variants back. **Memory methodology note** — the new paraphrase row in `docs/docs/02-benchmarks/memory-usage.md` reports RSS / `phys_footprint` deltas from a clean app baseline (loaded − idle), captured on-device at the same conceptual point. The existing distiluse rows there (36 / 44 MB) come from an older measurement pass with a different (and not reconstructable from the diff) methodology, so the two rows are not directly comparable. A separate pass to re-measure distiluse and other rows with the same methodology would be a good follow-up. ### Introduces a breaking change? - [ ] Yes - [x] No ### Type of change - [ ] Bug fix (change which fixes an issue) - [x] New feature (change which adds functionality) - [ ] Documentation update (improves or adds clarity to existing documentation) - [ ] Other (chores, tests, code style improvements etc.) ### Tested on - [x] iOS - [x] Android ### Testing instructions 1. `cd apps/text-embeddings && npx expo run:ios` (or `run:android`). 2. Pick **"Multilingual Paraphrase (8da4w)"** in the model picker. 3. Add a sentence in one language, query with an aligned sentence in another (e.g. Polish "Słoneczko" against "It's so sunny outside!"). The cross-lingual pair should top the matches. ### Related issues Closes the paraphrase-multilingual half of #945 (the distiluse half landed in #1098). ### Checklist - [x] I have performed a self-review of my code - [x] I have commented my code, particularly in hard-to-understand areas - [x] I have updated the documentation accordingly - [x] My changes generate no new warnings ### Additional notes Blocks on #1114.

msluszniak assigned benITo47 Apr 29, 2026

msluszniak added the chore PRs that are chores label Apr 29, 2026

msluszniak self-requested a review April 29, 2026 19:04

msluszniak mentioned this pull request Apr 30, 2026

feat: add PARAPHRASE_MULTILINGUAL_MINILM_L12_V2 text embeddings model #1115

Merged

12 tasks

feat: Extend tokenizer capabilities

f1341d2

msluszniak force-pushed the @bo/bumpTokenizerCapabilities branch from 78b5a13 to f1341d2 Compare April 30, 2026 12:14

msluszniak reviewed May 4, 2026

View reviewed changes

msluszniak approved these changes May 5, 2026

View reviewed changes

mkopcins approved these changes May 6, 2026

View reviewed changes

msluszniak merged commit e937c36 into main May 6, 2026
4 checks passed

msluszniak deleted the @bo/bumpTokenizerCapabilities branch May 6, 2026 09:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

build: Extend tokenizer capabilities#1114

build: Extend tokenizer capabilities#1114
msluszniak merged 1 commit intomainfrom
@bo/bumpTokenizerCapabilities

benITo47 commented Apr 29, 2026 •

edited by msluszniak

Loading

Uh oh!

msluszniak left a comment •

edited

Loading

Uh oh!

chmjkb commented May 4, 2026

Uh oh!

chmjkb commented May 4, 2026

Uh oh!

msluszniak commented May 4, 2026 •

edited

Loading

Uh oh!

msluszniak left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

benITo47 commented Apr 29, 2026 • edited by msluszniak Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Introduces a breaking change?

Type of change

Tested on

Testing instructions

Screenshots

Related issues

Checklist

Additional notes

Uh oh!

msluszniak left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chmjkb commented May 4, 2026

Uh oh!

chmjkb commented May 4, 2026

Uh oh!

msluszniak commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

msluszniak left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

benITo47 commented Apr 29, 2026 •

edited by msluszniak

Loading

msluszniak left a comment •

edited

Loading

msluszniak commented May 4, 2026 •

edited

Loading